Porting a Stochastic Part-of-Speech Tagger to Swedish

نویسنده

  • Douglas R. Cutting
چکیده

A b stract The Xerox Part-of-Speech Tagger (XPOST) claims to be practical. One aspect of practicality as defined here is reusability. Thus it is meant to be easy to port XPOST to a new language. To test this, XPOST was ported to Swedish. This port is described and evaluated. In previous work on part-of-speech tagging, a practical part-of-speech tagger was defined as one with the following set of properties (Cutting et a /1992): 1 • accu rate A tagger should assig n the co rre c t p a rt of det/2 n modal v det adj/2 n/2 prep speech to every word in the n prep/2 det n prep/4 det t e x t. n While 100% accuracy is desirable, it may not in fact be achievable. When text is manually tagged by several linguists, the tags assigned differ by a few percent, suggesting an effective upper-bound for tagging accuracy (Church 1989). fa st Ideally, the addition of part-of-speech tagging to a system will not significantly alter the speed with which text is processed. This may be difficult to evaluate, as systems which incorporate tagging may not operate at all without tagging. As a surrogate, one may compare the cost of assigning tags with that of simply extracting words from text— tokenization. If tagging is not significantly slower than tokenizing then its performance impact on complex text processing systems should certainly be minimal.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementing an Efficient Part-Of-Speech Tagger

An efficient implementation of a part-of-speech tagger for Swedish is described. The stochastic tagger uses a well-established Markov model of the language. The tagger tags 92% of unknown words correctly and up to 97% of all words. Several implementation and optimization considerations are discussed. The main contribution of this paper is the thorough description of the tagging algorithm and th...

متن کامل

The Open Source Tagger HunPoS for Swedish

HunPoS, a freely available open source part-of-speech tagger—a reimplementation of one of the best performing taggers, TnT—is applied to Swedish and evaluated when the tagger is trained on various sizes of training data. The tagger’s accuracy is compared to other data-driven taggers for Swedish. The results show that the tagging performance of HunPoS is as accurate as TnT and can be used effici...

متن کامل

Big is beautiful Bootstrapping a PoS tagger for Swedish

A statistical part-of-speech tagger trained on a one-million word Swedish corpus with validated tags was used to tag two considerably larger untagged corpora (≈ 78 and 20 million words, respectively) to bootstrap new, improved, tagger models. The new taggers all showed better accuracy both for seen and unseen words, and the best tagger had 97.02% overall accuracy evaluated on the original corpu...

متن کامل

Part-of-speech tagging for Swedish

This paper describes the work with a part-of-speech tagger for Swedish. The tagger used in the work was originally designed by Brill (1992) and may be adapted to different languages using annotated training corpora. The training corpus in this case is very small and may be the reason why the tagger is not very accurate in its original form. Extending the lexicon using different methods has enha...

متن کامل

Chinese Information Extraction and Retrieval

I. what was learned from porting the INQUERY information retrieval engine and the INFINDER term finder to Chinese 2. experiments at the University of Massachusetts evaluating INQUERY performance on Chinese newswire (Xinhua), 3. what was learned from porting selected components of PLUM to Chinese 4. experiments evaluating the POST part of speech tagger and named entity recognition on Chinese. 5....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993